Pathogen Surveillance Report
Summary
This report is produced by the nf-core/pathogensurveillance pipeline.
- Report group: no_group_defined
- Sample count: 3
- Last updated: June 25 , 2025
- Pipeline version: dev
Pipeline status
The pipeline finished execution. Warning or error messages describing problems encountered in the analysis might be provided below.
✅ No issues reported.
A list of issues reported by the pipeline during execution. When relevant, the sample IDs or reference IDs associated with the issue are included.
Input data
Identification
Initial identification
The following data provides tentative classifications of the samples based on exact matches of a subset of short DNA sequences. These are intended to be preliminary identifications. For more robust identifications based on whole genome sequences, see “Phylogenetic context” section below.
Initial classification of 3 samples identified all of them as:
Eukaryota > Viridiplantae > Streptophyta > Magnoliopsida > Cucurbitales > Cucurbitaceae > Cucumis > Cucumis hystrix
This table shows the “highest scoring” tentative taxonomic classification for each sample. Included metrics can provide insights into how each sample compares with reference genomes on online databases and the likelihood these comparisons are valid.
- Sample: The sample ID submitted by the user.
- WKID: Weighted k-mer Identity, adjusted for genome size differences.
- ANI: An estimate of average nucleotide identity (ANI), derived from WKID and kmer length.
- Completeness: The percentage of the reference genome represented in the query.
- Top Hit: The name of the reference genome most similar to each sample based on the scoring criteria used.
Most similar organisms
This table shows the Average Nucleotide Identity (ANI) between each sample and the two references most similar to it based on this measure. ANI is used to measure how similar the shared portion of two genomes are. Note that this measure only takes into account the shared portion of genomes, so differences like extra plasmids or chromosomal duplications are not taken into account.
This plot shows the results of comparing the similarity of all samples and references to each other. These similarity metrics are based on the presence and abundance of short exact sequence matches between samples (i.e. comparisons of k-mer sketches). These measurements are not as reliable as the methods used to create phylogenetic trees, but may be useful if phylogenetic trees could not be inferred for these samples.
POPC not calculated.
POPC not calculated.
Phylogenetic context
Shown are phylogenetic trees of samples with references sequences downloaded from RefSeq meant to provide a reliable identification using genome-scale data. The accuracy of this identification depends on the presence of close reference sequences in RefSeq and the accuracy of the initial identification.
Multigene phylogeny
This a multigene phylogeny of samples with reference genomes for context. It is the most robust identification provided by this pipeline, but taxonomic coverage is still limited by the availability of similar reference sequences.
Genetic diversity
SNP trees
This is a representation of a Single Nucleotide Polymorphism (SNP) tree, depicting the genetic relationships among samples in comparison to a reference assembly.
The tree is less robust than a core gene phylogeny and cannot offer insights on evolutionary relationships among strains, but it does offer one way to visualize the genetic diversity among samples, with genetically similar strains clustering together.
Question-does it make sense to be showing the reference within the tree?
Minimum spanning network
Threshold:
This figure depicts a minimium spanning network (MSN). The nodes represent unique multiocus genotypes, and the size of nodes is proportional to the # number of samples that share the same genotype.
The edges represent the SNP differences between two given genotypes, and the darker the color of the edges, the fewer SNP differences between the two.
Note: within these MSNs, edge lengths are not proportional to SNP differences.
Software and references
Methods
The pathogen surveillance pipeline used the following tools that should be referenced as appropriate:
- A sample is first identified to genus using sendsketch and further identified to species using sourmash (Brown and Irber 2016).
- The
nextflowdata-driven computational pipeline enables deployment of complex parallel and reactive workflows (Di Tommaso et al. 2017).
Analysis software
| module | program | version | citation |
|---|---|---|---|
| ASSIGN_BUSCO_REFERENCES | r-base | 4.2.1 | R Core Team (2021) |
| ASSIGN_MAPPING_REFERENCE | r-base | 4.2.1 | R Core Team (2021) |
| BAKTA_BAKTADBDOWNLOAD | bakta | 1.10.4 | Schwengers et al. (2021) |
| BUSCO_BUSCO | busco | 5.8.3 | Manni et al. (2021) |
| BUSCO_DOWNLOAD | busco | 5.8.3 | Manni et al. (2021) |
| BWA_INDEX | bwa | 0.7.18-r1243-dirty | Li and Durbin (2009) |
| BWA_MEM | bwa | 0.7.18-r1243-dirty | Li and Durbin (2009) |
| BWA_MEM | samtools | 1.2 | Danecek et al. (2021) |
| FASTP | fastp | 0.23.4 | Chen (2023) |
| FASTQC | fastqc | 0.12.1 | Andrews et al. (2010) |
| GATK4_VARIANTFILTRATION | gatk4 | 4.6.1.0 | Van der Auwera and O’Connor (2020) |
| GRAPHTYPER_GENOTYPE | graphtyper | 2.7.7 | Eggertsson et al. (2017) |
| GRAPHTYPER_VCFCONCATENATE | graphtyper | 2.7.7 | Eggertsson et al. (2017) |
| IQTREE_BUSCO | iqtree | 2.4.0 | Nguyen et al. (2015) |
| IQTREE_SNP | iqtree | 2.4.0 | Nguyen et al. (2015) |
| MAFFT_BUSCO | mafft | 7.52 | Katoh et al. (2002) |
| MAFFT_BUSCO | pigz | 2.8) | NA |
| PICARD_CREATESEQUENCEDICTIONARY | picard | 3.3.0 | “Picard Toolkit” (2019) |
| PICARD_FORMAT | picard | 3.3.0 | “Picard Toolkit” (2019) |
| QUAST | quast | 5.3.0 | Mikheenko et al. (2018) |
| SAMPLESHEET_CHECK | r-PathoSurveilR | 1.1.4 | NA |
| SAMTOOLS_FAIDX | samtools | 1.21 | Danecek et al. (2021) |
| SAMTOOLS_INDEX | samtools | 1.21 | Danecek et al. (2021) |
| SOURMASH_COMPARE | sourmash | 4.8.14 | Brown and Irber (2016) |
| SOURMASH_SKETCH | sourmash | 4.8.14 | Brown and Irber (2016) |
| SPADES | spades | 4.0.0 | Prjibelski et al. (2020) |
| SUBSET_BUSCO_GENES | r-base | 4.2.1 | R Core Team (2021) |
| TABIX_BGZIP | tabix | 1.2 | Li (2011) |
| TABIX_TABIX | tabix | 1.2 | Li (2011) |
| VCFLIB_VCFFILTER | vcflib | 1.0.3 | Garrison et al. (2022) |
| VCF_TO_SNP_ALIGN | r-base | 4.2.1 | R Core Team (2021) |
| Workflow | nf-core/pathogensurveillance | v1.0.0 | |
| Workflow | Nextflow | 24.10.0 | Di Tommaso et al. (2017) |
R packages used
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] PathoSurveilR_0.4.0
loaded via a namespace (and not attached):
[1] mnormt_2.1.1 gridExtra_2.3 polysat_1.7-7
[4] phangorn_2.12.1 permute_0.9-7 rlang_1.1.6
[7] magrittr_2.0.3 ade4_1.7-23 compiler_4.5.0
[10] mgcv_1.9-1 vctrs_0.6.5 maps_3.4.3
[13] reshape2_1.4.4 combinat_0.0-8 quadprog_1.5-8
[16] stringr_1.5.1 pkgconfig_2.0.3 fastmap_1.2.0
[19] labeling_0.4.3 ca_0.71.1 promises_1.3.3
[22] rmarkdown_2.29 purrr_1.0.4 xfun_0.52
[25] cachem_1.1.0 seqinr_4.2-36 aplot_0.2.5
[28] clusterGeneration_1.3.8 jsonlite_2.0.0 later_1.4.2
[31] adegenet_2.1.11 cluster_2.1.8 parallel_4.5.0
[34] R6_2.6.1 bslib_0.9.0 stringi_1.8.7
[37] RColorBrewer_1.1-3 boot_1.3-31 jquerylib_0.1.4
[40] numDeriv_2016.8-1.1 Rcpp_1.0.14 assertthat_0.2.1
[43] iterators_1.0.14 knitr_1.50 optimParallel_1.0-2
[46] base64enc_0.1-3 splines_4.5.0 httpuv_1.6.16
[49] rentrez_1.2.3 Matrix_1.7-2 igraph_2.1.4
[52] tidyselect_1.2.1 yaml_2.3.10 viridis_0.6.5
[55] vegan_2.6-10 RcppSimdJson_0.1.13 TSP_1.2-5
[58] doParallel_1.0.17 codetools_0.2-20 curl_6.2.3
[61] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
[64] shiny_1.10.0 treeio_1.32.0 withr_3.0.2
[67] coda_0.19-4.1 evaluate_1.0.3 phytools_2.4-4
[70] gridGraphics_0.5-1 heatmaply_1.5.0 xml2_1.3.8
[73] pillar_1.10.2 ggtree_3.16.0 DT_0.33
[76] foreach_1.5.2 ggfun_0.1.8 plotly_4.10.4
[79] generics_0.1.4 ggplot2_3.5.2 scales_1.4.0
[82] tidytree_0.4.6 xtable_1.8-4 bspm_0.5.7
[85] glue_1.8.0 scatterplot3d_0.3-44 lazyeval_0.2.2
[88] tools_4.5.0 dendextend_1.19.0 ggnewscale_0.5.1
[91] data.table_1.17.4 webshot_0.5.5 registry_0.5-1
[94] fs_1.6.6 XML_3.99-0.18 fastmatch_1.1-6
[97] grid_4.5.0 tidyr_1.3.1 ape_5.8-1
[100] crosstalk_1.2.1 seriation_1.5.7 colorspace_2.1-1
[103] nlme_3.1-167 patchwork_1.3.0 cli_3.6.5
[106] DEoptim_2.2-8 expm_1.0-0 viridisLite_0.4.2
[109] poppr_2.9.6 dplyr_1.1.4 gtable_0.3.6
[112] yulab.utils_0.2.0 sass_0.4.10 digest_0.6.37
[115] ggplotify_0.1.2 htmlwidgets_1.6.4 farver_2.1.2
[118] htmltools_0.5.8.1 lifecycle_1.0.4 pegas_1.3
[121] httr_1.4.7 mime_0.13 MASS_7.3-65
References
About
Contributors
The nf-core/pathogen surveillance pipeline was developed by: Zach Foster, Martha Sudermann, Camilo Parada-Rojas, Logan Blair, Fernanda Iruegas-Bocardo, Ricardo Alcalá-Briseño, Alexandra Weisberg, Jeff Chang and Nik Grünwald.
Funding
This pipeline is supported by NIFA grants 2021-67021-34433, 2023-67013-39918 to JHC and NJG and ARS Project 2072-22000-045-000-D to NJG.
Contribute
To contribute, provide feedback, or report bugs please visit our github repository.
Citations
Please cite this pipeline and nf-core in publications as follows:
Foster et al. 2025. PathogenSurveillance: A nf-core pipeline for rapid analysis of pathogen genome data. In preparation.
Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow Enables Reproducible Computational Workflows. Nature Biotechnology 35: 316–19. https://doi.org/10.1038/nbt.3820.
Other tools
Icons for this report were sampled from Bootstrap Icons, Freepick, Iconify, Academicons, and Font Awesome and the report was rendered in quarto